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“Reverse Nearest Neighbor” query finds applications in decision support systems, profile-based marketing, 
emergency services etc. In this paper, we point out a few flaws in the branch and bound algorithms proposed 
earlier for computing monochromatic RfcNN queries over data points stored in hierarchical index. We give 
suitable counter examples to validate our claims and propose a correct algorithm for the corresponding 
problem. We show that our algorithm is correct by identifying necessary conditions behind correctness of 
algorithms for this problem. 


1. INTRODUCTION 

One important type of operation that is gaining popularity in database and data- 
mining research comm unity is the Reverse Nearest Neighbor Query (RfcNN) [Korn and 


Muthukrishnan 20001. Given a set of database objects O and a query object Q, the 
RfcNN query returns those objects in O, for which Q is one of their fc nearest neigh¬ 
bors; here the notion of neighborhood is with respect to an appropriately defined notion 
of distance between the objects. A classic example RfcNN is in the domain of decision 
support systems where the task is to open a new facility (like a restaurant) in an area 
such that it will be least influenced by its competitors and attract good business. An¬ 
other application is profile based marketing [Korn and Muthukrishnan 20001, where 
a company maintains profiles of its customers and wants to start a new service which 
can attract the maximum number of customers. RfcNN has also applications in cluster¬ 
ing, where a cluster could be created by identifying a group of objects, and clustering 
them around their common nearest neighbor point - this essentially involves finding 
cluster centers with high cardinality of reverse nearest neighbor sets. Reciprocal near¬ 
est neighborhood, in which data points which are nearest neighbors of each other are 
clustered together (and therefore, satisfy both nearest neighbor and reverse nearest 
neighbor criteria), is another well-known technique in clustering [Lopez-Sastre et al. 
20121. 

This important concept has seen a series of remarkable applications and algorithms 
for process ing diff erent types of o bjects, in various c o ntexts and under var i ations | Kang 
et al. 2007J, [Saf ar et al. 2009 , [Tran et al. 2009), ITaniar et al. 2011|,||Shang et ah 
2011|, ||(Jh eema et al. 20121, [Ghaemi e t al. 2012||,||Li et al. 2013| , [Emric net al. 2 014], 
Cabello et al. 2010|, |Bhattacharya and Nandy 20131 of the problem parameters. The 


focus of this paper is monochromatic RfcNN queries - in this version, all objects in the 
database and the query belong to the same category, unlike the bichromatic version 
in which the objects can belong to different categories. Furthermore, we want to focus 
on queries where fc is specified as part of a query, and want to support objects from an 
arbitrary metric space. 

This paper points out several fundamental inaccuracies in three papers published 
earlier on the problem mentioned above. 


— Reverse fc-nearest neighbor search in dynamic and general metric databases [Achtert 
let al. 20091 

— Reverse spatial and textual k nearest neighbor search [Lu et al. 20111 

— Efficient algorithms and cost models for reverse spatial-keyword k-nearest neighbor 
search [Lu et al. 20141 
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Achtert et al.[Achtert et al. 20091 proposed a branch-and-bound algorithm for the 
above problem which could use any given hierarchical tree-like index on data from 
any metric space. Lu et al. [Lu et al. 2011| proposed a similar algorithm, but specif¬ 
ically optimized for spatio-textual data, for answering RSTfcNN queries using a spe¬ 
cialized IUR tree as the indexing structure. In a followup paper [Lu et al. 2014|, they 
proposed an improvement of their algorithm (including correcting an error) and a the¬ 
oretical cost model to analyze the efficiency of their algorithm. However, we observed 
several deficiencies in the algorithms mentioned above. In this paper we will point out 
those inaccuracies, and discuss them more formally by pointing out some key prop¬ 
erties which these algorithms violate, but are necessary for ensuring correctness of 
these and other similar algorithms. We will present detailed counter examples and 
suggest corrective modifications to these algorithms. Finally we will propose a correct 
algorithm for performing RfcNN queries over a hierarchical index and also present its 
proof of correctness. 

The paper is organized as follows. In Section [2] we explain the three published ap¬ 
proaches mentioned above in which we found inaccuracies. In Section[3]we describe our 
counter-examples with respect to them. We present our modified algorithm in Section 
|4j and its proof of correctness in Section [4.6| 

2. EARLIER RESULTS 

The underlying algorithms for all three approaches mentioned above essentially have 
the same structure and follow a branch-and-bound approach. The former work is ap¬ 
plicable on any kind of data with a distance measure that is a metric, and uses any 
hierarchical tree-like index built on the data. The two latter work are specifically con¬ 
cerned with RfcNN query on spatio-textual data, which they refer to as RSTfcNN query. 
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Fig. 1: Example for illustrating RSTfcNN and RfcNN 


In RSTfcNN, each object is represented by a pair ( loc,vct ) where loc is the spa¬ 
tial location and vet is the associated textual description which is represented by 
(word,weight(word)) pairs for all words appearing in the database. Weight of a word 
is calculated on the basis of TF-IDF scheme [Salton and Buckley 19881. Spatio-textual 
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similarity ( SimST) is defined by [Lu et al. 20111 as follows: 

SimSTjo^) = »«(!- ■ i0C - ° *; loc) ~ ^ ) + (1 ■- o) , ( ■<»■'*>) Z *< ) (1) 


4>s - Vs 


4>t - Vt 


The parameter a is used to define the relevance factor for spatial and textual similarity 
while calculating the total similarity scores and is specified in a query. ip s and denote 
the minimum and maximum distance between any two objects in the database and 
are used to normalize the spatial similarity to the range [0,1]. Similarly and ip t 
denote the minimum and maximum textual similarity between any two objects in the 
database, dist(-) is the Euclidean Distance between oi and o 2 and EJ is the Extended 
Jaccard Similarity [Tan and Steinbach 20111 defined as: 


EJ(o\.VCt 1 0 2 .VCt) = 


ELi Oi.Wj * o 2 .w'j 


En O . / z # 

j=l °1 - W J + Ej = 1 02-W'j - Ej= 1 Oi.Wj * 0 2 .w’j 


( 2 ) 


where oi.vct=(wi,... ,w n ) and o 2 .vct=(w[,... ,w' n ). 

As an example, consider Figure [l] There, considering only location attributes, and 
for k = 2, RfcNN of Q are objects P 3 and P 4 . However, if we consider both spatial and 
textual similarity, and taking fc = 2 and a = 0.4, RST/. NN of Q is P 2 , P 3 and P 4 . 

Now we will describe the actual algorithm proposed by [Lu et al. 20111 for RSTfcNN. 
It is important to present it in some detail - this is required for proper appreciation 
of the inaccuracies in this algorithm. This algorithm requires its data to be organized 
as an hierarchical index called as IUR-tree. IUR-Tree is a R-Tree [Guttman 1984); 


where every node of the tree is embedded with Intersection and Union Vectors. The 
textual vectors contain the weight of every distinct item in the documents contained 
in the node. The weight of every item in the Intersection Vector (resp. Union Vector) 
is the minimum weight (resp. maximum weight) of all the items present in the doc¬ 
uments contained in the node. During the execution of the algorithm, a lower and 
upper nearest-neighbor list/contribution list is created and maintained for each node 
in the IUR-Tree. The lower (resp. upper) contribution list stores the minimum (resp. 
maximum) similarity between the node and its neighbors. 
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Fig. 2: IUR-Tree and Textual Vectors of Fig 1 
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Algorithm 1 RSTfcNN (R\ IUR-Tree root,Q: query) from [Lu et al. 20111 
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Output: All objects o, s.t o eRSTfcNN ( Q,k,R). 

Initialize a priority queue U, and lists COL , ROL, PEL ; 

EnQueuefU, R); 
while U is not empty do 

P DeQueue(L'); //Priority of U is MaxST(P , Q) 
for each child node E of P do 
Inherit! E.CLs, P.CLs); 
if IsHitOrDropCff, Q)==false then 

for each node E' in COL , ROL , U do //see subsection |3.2| 
UpdateCLO/. E')\ //update contribution lists of E; 
if IsHitOrDrop(ff, <3)=true then //see subsection |3.3| 
break; 
end if 

if E' £ U U COL then 

UpdateCLf//', E); //Update contribution Lists of E' using E. 
if IsHitOrDrop! E' , Q)==true then 
Remove E' from U or COL ; 

end if 
end if 

if E is not a hit or drop then 
if E is an index node then 
EnQueuefU, E); 

else 

UOL.appendfEj; //a database object 

end if 
end if 
end for 
end if 
end for 
end while 

Final JVeri fication( GY/ L, PEL, Q); 


The IUR-Tree and Intersection and Union Vectors of the corresponding nodes is 
shown in the Figure [2] These vectors along with the MBR’s of nodes are used to com¬ 
pute the similarity approximations i.e. upper and lower bounds on the spatio-textual 
similarity between two groups of objects. 

We refer to an internal node or a point in the IUR-Tree as an entry. The algorithm 
takes as an input an IUR-Tree (Intersection Union tree) R, query Q and returns all 
database objects which are RSTfcNN of Q. The data structures used are: a priority 
queue (U) sorted in decreasing order on MaxST(E,Q), result list (ROL), pruned list 
(PEL) and candidate list (COL). MaxST(E , Q) is the maximum spatial textual simi¬ 
larity of the entry E with the query point Q. The algorithm dequeues the root of the 
IUR-Tree from the queue and for every child E of the root, inherits the contribution 
list of its parent. The function UpdateCL(//, E') is invoked and the contribution list of 
E is updated with every E' present in the candidate list, result list and the priority 
queue. After every invocation to UpdateCL(-), the algorithm checks based on the min¬ 
imum and maximum bound similarity scores with the k th nearest neighbor, whether 
to add E to the results, candidates or pruned list. If E can’t be pruned or added to the 
results, the contribution list of E' is updated with E. This process is called the mutual 
effect. If E' can be added to the results or pruned, it is removed from the queue or COL. 
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function Final_V'ERIFICATION(C'OL, PEL, Q) 

while COL ^ 0 do 

Let E be an entry in PEL with the lowest level; 

PEL = PEL - {£}; 

for each object o in COL do 

UpdateCL(o, E); //update contribution lists of o. 


if IsHitOrDrop(o, Q)==true then // see subsection 3.3 

COL = COL - {o}; 


end if 
end for 

for each child node E' of E do 

PEL = PEL U {E'}; //access the children of E' 

end for 
end while 
end function 


After updating node E with all entries of COL, ROL or U, the function IsHitorDropO 
is again invoked. If E can’t be added to the result or pruned list, a check is performed 
to find out whether if is a internal node or a point. If E is an internal node, it is added 
to the queue, else to the candidate list. When the queue becomes empty, there might 
be some objects left in the candidate list. The function Final-VerificationO is invoked 
where the candidate objects are updated with all the entries present in PEL to decide 
whether they belong to result or not. 


3. COUNTER-EXAMPLES 

We describe three counter example in this section: 


(1) Inaccuracy regarding computation of MinT and MaxT 

(2) Inaccuracy w.r.t. Locality Condition 

(3) Inaccuracy w.r.t. Completeness Condition 


All these examples are illustrated with respect to the algorithm described in i Lu et al. 
2011); however we also explain the concepts used in constructing these examples 


therefore these examples can be easily modified to suit the other algorithms. We ob¬ 
served that [Achtert et al. 20091 proposed an algorithm which maintains the locality 
condi tion, but violates the completeness condit ion. We recen tly observed that |Lu et al. 
2014| modified their previous algorithm from [Lu et al. 20111 which now maintains the 
locality condition. However, their algorithm still violates the completeness condition. 


3.1. Inaccuracy regarding computation of MinT and MaxT 

The branch-and-bound algorithm presented in [Lu et al. 20111 required cleverly con¬ 
structed lower and upper bounds on the textual similarity (and combined textual- 
spatial similarity) between two groups of data objects. Its authors defined MinT (min¬ 
imum possible similarity) and MaxT (maximum possible similarity) and claimed that 
these definitions, when used in conjunction with upper and lower bounds on spatial 
similarity, give valid upper and lower bounds on the similarity between two groups of 
objects. To prove this claim, they used the following crucial lemma. The first inaccuracy 
we report is regarding this lemma. 


Definition 3.1 (Similarity Preserving Function). [Lu et al. 20111 Given two func¬ 
tions fsim : Vx V —i M and fdim : RxR. —> R, where V denotes the domain of //-element 
vectors and R, the real numbers, fsim is a similarity preserving function w.r.t fdim, 
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such that for any three vectors p = (x \..., x n ), p' = {x i ..., x n ),p" = (xi ,... x n ), if 
V* £ [l,n], fdim(xi,Xi ) > fdim(xi,Xi ), then we have fsim(p,p') > fsim(p,p"). 

LEMMA 3.2. [Lu et al. 2011] Extended Jaccard is similarity preserving function wrt. 


function fdim(x,x r ) — 


min(x,x 


max(x,x ■ 


for x , x' > 0. 


Counter Example. Consider three points p, p', p" with textual vec tors p = (100,30), 
p' = (1,40), p" = (1,50). Using fdimf,-) as defined in Lemma 3.2 observe that the 


given points satisfy the conditions for a similarity preserving function, i.e., Mi £ [1,2], 
r. However, EJ(p,p') = 0.116 ^ EJ{p,p") = 0.135 which contra- 


%{Xi ,Xi) 


> 


min(xi ,x f 


max(xi,x'.) — max(xi,x'-') * 

diets Definition 3.1 The MinT and MaxT formula given in the paper relied on the 
above Lemma to be correct, which therefore become invalid. 

We now present our approach to calculate MinT and MaxT between two groups of 
textual objects E and E'. As explained earlier, every textual object is represented as 
a vector of term frequencies. For any group of objects, their intersection vector (resp. 
union vector) has been defined to be a vector whose every coordinate is the minimum 
(resp. maximum) frequency among the corresponding coordinates of objects. Denoting 
the intersection and union vectors of E as (E.ii,E.i 2 , ■ ■ ■) and (E.ui, E.u 2 , ■ ■ •), notice 


that for every o £ E, and j £ 
formulae for MinT. 

MinT(E, E') = 


[l£ 


E.ij < 


Vj < 


E.Uj. We propose the following 


ELi E.i 3 *E'.i 3 


E? =1 E - u j + EL* E'.vZ - EL 


=1 E.ij * E'ij 


(3) 


^ = 1 3 t-^3 =1 3 3- 

The idea for computing MinT is that since it is a lower bound, we want to minimize 
the term in the numerator and maximize the denominator of EJ to ensure that Vo £ E 
and Vo' £ Ef Ed(o. o') > MinT(E, E'). Similarly formulae for MaxT is given below: 


MaxT(E, E') = 


E?=i E.i 


‘3 * E '- U 3 


E U E -'1 


ELi E'.E - ELi e -i 


-3 * E ' U 3 


(4) 


3.2. Inaccuracy w.r.t. Locality Condition 



(a) Distribution of Points 


(b) IUR Tree 


Fig. 3: Counter-example (Locality and Completeness conditions) 
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Consider the following counter-example for the dataset and IUR-Tree illustrated in 
Figure [3j and let a = 1 and k = 2. The minimum and maximum distance between any 
two points in the database is ip s = 7.07 and ?fy=142.21. The exact RSTfcNN of the query 
point Q is / o and Pi. The trace of the algorithm [Lu et al. 20111 is shown in Table |T] We 
will focus on step 1 here. The root of the tree is dequeued from the tree and node Ni 
is processed. Ni inherits the contribution lists of its parent, which is empty. Since U, 
ROL and COL are empty, Ni is simply added to the queue. Now, node N 2 is processed. 
N 2 updates its upper and lower contribution lists with and invokes IsHitOrDrop. 
The upper and lower contribution lists of N 2 upon invoking IsHitOrDrop is : 


N 2 . i .CL={(iV 1 ,0,2)} 

N 2 . u ,CL={(Ni, 0.68,2)} 

Since MinST(N 2 ,Q) = 0.73, which is more than the upper bound given by N 2 . U .CL, at 
this point node N 2 is accepted (wrongly) as the RSTfcNN of Q. 


Table I: Trace of RSTfcNN Algorithm (2011) 


Steps 

Actions 

U 

COL 

ROL 

PEL 

1 

Dequeue Root, Enqueue 

Ni 

Ni 

0 

P’2, P 3 , Hi, P 5 

0 

2 

Dequeue Ni 

0 


Pc,Pi ,P 2 , P 3 Pa, P 5 

0 


We attribute this fault to the violation of the Locality Condition, a property that, we 
claim, must have been followed by these algorithms. 

Locality Condition. Nearest neighbors of data points in a node may belong to the 
node itself; hence, every node should compute similarity with itself and include itself 
as a candidate (along with other similar nodes) in any test to prune or accept the node 
as RSTfcNN of Q. 

In the counter-example above, node N 2 does not satisfy this condition since its con¬ 
tribution lists do not contain itself or points inside it. 


3.3. Inaccuracy w.r.t. Completeness Condition 


The trace of the algorithm [Lu et al. 20141 is shown in Table [TT] 


Table II: Trace of RSTfcNN Algorithm (2014) 


Steps 

Actions 

U 

COL 

ROL 

PEL 

1 

Dequeue Root, Enqueue 
Ni, Enqueue N 2 

Ni,N 2 



IT 

2 

Dequeue N 2 

N ± 


p 2 ,p 3 

Na 

3 

Dequeue N\ 

0 

0 

Pi, Pi, P'2, P 3 

Na 


We will now focus on Step 2, when node N 2 is dequeued from the priority queue, 
and its children are now being processed. Node N 3 is now processed and it inherits 
the contribution lists of its parent N 2 . The function IsHitOrDrop is called, but N :i 
can’t be pruned or added to the results. After invocation of IsHitOrDrop, N 3 updates 
its contribution list with itself to maintain the locality condition. N 3 further updates 
its contribution list with other entries present in COL, ROL and U sorted in the 
decreasing order of the maximum spatio-textual similarity with N 3 . The upper and 
lower contribution list of N 3 is shown below : 
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N 3 . l .CL={(N 3 , 0.94,1), (JV l5 0,2)} 

A 3 y.CL={(A 3 ,l,l),(Ai,0.68,2)} 

Since MaxST(N 3 ,Q) = 0.90, which is less than 0.68; so at this point N 3 is accepted 
(wrongly) as RSTfcNN of Q. We claim that this faulty behaviour is due to not ensuring 
the Completeness Condition, viz., absence of Nj in contribution lists of N 3 . This 
condition is discussed in more detail in Section [4~3| In this example, the contribution 
lists of N 3 is not complete. 

4. PROPOSED RSTNNN QUERY ALGORITHM 

In this section, we present a modified algorithm to answer RSTfcNN queries. We will 
illustrate our algorithm with an example, pointing out the modifications and end this 
sections with a formal proof of correctness. We begin by formalizing some notions which 
will be used in the algorithm, and will be crucial in ensuring its correctness. 

As explained earlier, the algorithms we considered worked on data that was stored 
in a hierarchical tree-like index, where the leaf nodes are data points themselves (to be 
represented by small letters) and internal nodes (to be represented by CAPITAL let¬ 
ters) contain pointers to children nodes. Our modified algorithm will share backbone 
of these algo rithms; however, structually, i t will bear resemblance to the algorithm 
presented in | |Lu et al. 20111 |Lu et al. 2014Q . However, it will be presented in a gen¬ 
eralized manner which can be used to perform RfcNN queries, given any value of fc, 
on a wide variety of data and independent of the explicit indexing structure used. The 
only requirement from the data and the index is a similarity measure Sim {•, •) among 
the data points, information about the of number of objects in each node and estimates 
MinSim and MaxSim among nodes (explained below). 


4.1. Contribution List a.k.a. ALV-list 

We will use the following notation: if e! is the k th nearest neighbor of e, then we will 
write e' as kNN(e). We will use the convention that a point is the 0 th nearest neighbor 
of itself. An immediate observation is the following: Sim(e , kNN(e)) > Sirri(e, k'NN{e )) 
for any fc' > fc. 

One way to answer RfcNN queries is by computing the list of nearest neighbors 
(ALV-list) for every data point e: NN(e) is an ordered list of data points (ei, e 2 , e 3 ,...) 
such that e\ is lNN(e), e 2 is 2 NN(e) and so on. Computing this list explicitly for ev¬ 
ery data point could b e very inefficient. The usual approach followed by branch-and- 
bound algorithms like [Achtert et al. 2009; Lu et al. 2011; Lu et al. 2014] is searching 
the index top-down while maintaining two NN-lists with each node - one contains 
an overestimate of its nearest neighbor, and another containing an underestimate 
of the same. These estimated lists are constructed using two functions MinSimf-, •) 
and MaxSim{ •, •) which must satisfy the property below. The actual implementation of 
these functions depend crucially on the type of data used and the index. For two nodes 
E and E ', 


— MinSim(E , E') must give a lower bound for the minimum similarity between pairs 
of points from E and E' i.e. Ve £ E ,Ve' £ E', Sim(e, e') > MinSim(E , E'). 

— MaxSim(E,E') must give an upper bound for the maximum similarity between 
pairs of points from E and E' i.e. Ve £ E , Ve' £ E ', Sim(e , e') < MaxSimfE, E'). 

Next, we will define the main component of our algorithm, a formalization of contri¬ 
bution lists (CL) used in earlier algorithms. 

Definition 4.1 ( NN-list ). An NN-list of a node A is a list of tuples: 
((Ai, mi), ( E 2 , m 2 )...), where each E, is a node and in, is a positive integer. 
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The NN-lists we will maintain per node are NNu(E) and NNl(E) whose tuples will 
provide estimates to the similarity of E to its r th nearest neighbor, for various values 
of r. 


4.2. Lower bound list NN L 

The central idea behind the NNl list comes from the following observation. Suppose 
for a set of to points {, e' 2 ...., e' m } and another point e, we have that Sim(e, e') > s. 
Then, it is obvious that if e does not belong to this set, Sim(e,mNN(e)) > s; and 
if e belongs to this set, then Sim(e, (m — 1 )NN(e)) > s. Extending this concept to 
nodes, consider any node E with m data points; now, if MinSim(E,e ) > s then, 
Sim(e,mNN(e)) > s if e E and Sim(e,(m — l)NN(e)) > s if e € E. Notice that 
these bounds are tight. 

We can even extend this idea to multiple nodes to get the following claim. Let e be 
a data point and E\, E 2 , ■ ■., E k be a collection of non-overlapping nodes which do not 
contain e, where the list is sorted in decreasing order of MinSim(Ei, e). Let m, denote 
the number of data points in E it and let s t be a lower bound on MinSim(E i , e). Then, 
for all j = 1... k, Sim(e, (J2l=i) m i)NN(e)) > Sj. If e £ E t for some i, then m, must be 
replaced with m, — 1. We can generalize this even further by considering a node instead 
of e. 

Definition 4.2 ( Lower NN-list ). An NN-list ((Eli, mi), ■ ■ ■) of non-overlapping nodes 
is a valid NNl(E) if: 

— the list is sorted in decreasing order of MinSim(Ei,E) 

— for all e € E, if E does not overlap with E it then to, < \E,\ and if E overlaps with E t , 
then nii < \Ei\ — 1 

The following lemma describes the use of lower NN-lists to get underestimates of 
nearest neighbors. The proof is immediate from earlier definitions. 

LEMMA 4.3. For any t and i that satisfies Y^k=i m fc < t < Efc=l m k (including the 
case t < m\,i = 1), it holds that for all e £ E, Sim(e,tNN(e)) > MinSim(e,Ei). 


4.3. Upper bound list NN V 

We want to define NN V as an over esti mation of nearest neighbors similar to NN L and 
derive a similar lemma as Lemma [4.3[ however, we require an additional concept first. 

Definition 4.4 (Complete NN-list). We say that an NN-list NN(E) is complete if ev¬ 
ery data point is present in some node in the NN-list, and for every ( E^mi ) in the 
list, 


— if E does not overlap with Ei, then to, = E, 

— if E overlaps with Ei, then to, |/i, | - 1 


It must be noted that an NNl list need not be complete for it to satisfy Lemma 


4.3 However, similar arguments do not work for NN V . Take for example, the example 
situation similar to the one described for NN L : we have a set of points {ef , e' 2; ... e' m } 
and another point e (all distinct). But even if we know that Sim(e ) e 'i) < s for some s 
and for all i, it is nevertheless not true that Sirnie, rnNN(e)) < s, unless, all points 
other than e are in the set - which is precisely what a complete NN-list specifies. 

Now we can define similar concepts like NNl. 


Definition 4.5 (Upper NN-list). For a node E, an NN-list ((Ei,m\),...) of non¬ 
overlapping nodes is a valid NNu(E) when the following holds: 

— the list is sorted in decreasing order of MaxSim(Ei, E) 
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— the list is complete 

Observe that the completeness condition requires that NNu(E) must contain E it¬ 
self, or its parent node, or all its ch ildr en nodes - this is essentially the locality condi¬ 
tion we mentioned earlier (Section |3.2| l. However, we have chosen to specifically high¬ 
light the above condition separately from the more general completeness condition. 
The main working lemma for NNu follows next. 

LEMMA 4.6. For any t and i such that Y^k =1 m k < t < Y?k=i (including the case 
when i = 1 and t < mi), it holds that for all e £ E, Sim(e, tNN(e)) < MaxSini(e, Ef). 

4.4. Branch-and-bound traversal 

A branch-and-bound algorithm traverses a hierarchical index by first visiting the root, 
and then exploring its children nodes, and so on. For every node it visits, the algorithm 
decides what to do next based on some estimate of the relevance of the current node 
to the desired answer (here, NNu and NN L lists). It may choose to further explore the 
node, add all the points in the node to the result set and not explore the node further 
(aka. accepting the node), or, simply not explore the node further because it decided 
that the node does not contain any point that should be in the result set (aka. pruning 
the node). 

Suppose the query point is denoted by Q; and suppose that a branch-and-bound 
algorithm is currently visiting E during its traversal of the index. Let NNl(E) denote 
the (valid) lower NN-list of E node, and NN V (E) denote its (valid) upper NN-list. 
Also, suppose i is the smallest index such that k < J2t =l m fc f° r NNl(E), and j is the 
smallest similar index for NNu(E). 

Here are the main theorems that give us sufficient conditions for accepting and prun¬ 
ing certain nodes in the index during a branch-and-bound traversal. 

Theorem 4.7 (Accepting and Pruning Condition). 

(1) If MaxSim(E, Q) < MinSim(E , Ef), then Q cannot have any node in E in its RkNN 
set. Therefore, E can he pruned. 

(2) If MinSim(E,Q) > MaxSim(E , Ej), then all nodes in E belong to RkNN of Q and 
so E can be accepted. 

The proofs for the two cases are immediate from Lemma 

4.5. Algorithm 

Now we will discuss the modified algorithm for finding reverse nearest neighbors on 
spatial-textual objects. Our algorithm is a modification of the one proposed in |Lu et al. 
20111, so we will mostly engage in highlighting the major changes. Like the original 
algorithm, our algorithm uses the following data structures: a FIFO queue (U), a result 
list ( ROL ), candidate list (COL) and pruned list (PEL). We use a FIFO queue instead 
of a priority queue, as each entry of needs to update its NN-list with every other entry 
present in every list in order to ensure completeness of lists. So, the order in which 
other entries are added is irrelevant. We will frequently use NN-lists to refer to both 
the upper and lower NN-lists of the corresponding entry. 

As before, the algorithm initializes the lists and enqueues the root of the IUR-tree. 
While the queue is not empty, an entry E is dequeued from the queue and its parent is 
removed from its NN-list. The two key modifications we suggest are stated next. First, 


4.3 and|4.6[ respectively. R 


1 For accepting or pruning, in case there is a tie between similarities between query point and a database 
point, we tie-break in favour of points in the database. The alternative approach requires straight forward 
modification to the results in this subsection. 
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Algorithm 2 RSTfcNN ( R: IUR-Tree root,Q: query) 

l: Output: All objects o, s.t o GRSTfcNN(<3, fc, R). 

2 : Initialize a FIFO queue U, and lists COL , ROL , PEL; 

3: EnQueue((7, R); 

4: while U is not empty do 

5: E <- DeQueueR/); //FIFO Queue 

6: for each tuple (El, nurrii) G NN L (E) do 

7: if E[ = E or E[ = Parent(E) then 

8: remove (El,nurrii) from NN L (E) and NN V (E ) ; 

9: end if 

10: end for 

ll: if ( then E is an internal node) 

12: Additself(//) //Ensure locality condition 

13: end if 

14: for each entry E' in U do // Ensure completeness condition 

15: Update _NN-list(.E, E'); //mutual effect 

16: Update _NN-list(-E', E); //mutual effect 

17: end for 

18: if E is not a hit or drop then 

19: if E is an index node then 

20: for each child Ce of E do 

21: lnherit(NN L (C E ),NN L (E)); 

22: lnherit(NNu(C E ),NNu(E)); 

23: EnQueuelCs) 

24: end for 

25: else 

26: COL.append(£j; 

27: end if 

28: end if 

29: end while 

30: FinaLVerification(CO L, PEL , ROL , Q); 


if E is an internal node of the tree, it adds itself to its NN-lists, thereby maintaining 
the locality condition (line 12). Then E updates its NN-lists with each entry E' present 
in the queue and vice versa. The updation of NN-list of E with every other entry in 
the queue maintains the completeness condition (line 14). After this, IsHitorDrop is 
invoked to check if E can be pruned or added to the results. If E can neither be pruned 
nor added to the results, its children are added to the queue if E is an internal node; 
otherwise, E is added to the candidate list. We continue with the optimisation of having 
the children of E copy the NN-list of E before they are enqueued to U. When the queue 
becomes empty, there might be some candidate points left in the candidate list. The 
procedure FinaLVerification is invoked to decide whether the points present in the 
candidate list belong to the result list or the pruned list; this procedure essentially 
checks every candidate point with other entries. 

We illus trate the working of our algorithm on the example presented earlier (Figure 
[3]> in Table III As expected, the algorithm now correctly returns Pq and P\ as the only 
points in RSTfcNN of (). 


4.6. Proof of Correctness 

We will now give a formal proof of correctness of our algorithm. Essentially, we will 
show that, when an index node is checked (line 18) if it can be immediately accepted 
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function FlNAL_VERIFICATION(COL, PEL , ROL , Q) 
PEL = SubTree(PEL) 

while COL ^ 0 do 

for each point o in COL do 
for each point r in ROL do 
Update _NN-list(o, r ); 

end for 

for ( doeach point p in PEL) 

Update _NN-list(o, p); 

end for 

for ( doeach point c' in COL — {o} ) 

Update _NN-list(o, c'); 

end for 

if IsHitOrDrop(o, Q)==true then 

COL = COL - {o}; 

end if 
end for 
end while 
end function 


Table III: Trace of our algorithm 


Steps 

Actions 

U 

COL 

ROL 

PEL 

1 

Dequeue Root, Enqueue 
Ni, Enqueue N 2 

n u n 2 

0 

0 

0 

2 

Dequeue N 4 

N 2 , Po, Pi 

IT - 

~0 

IT - 

3 

Dequeue N 2 

P 0 , Pi, N 3 , N 4 

0 

0 

0 

4 

Dequeue P 0 

Pi, N 3 , N 4 

0 

Po 

0 

5 

Dequeue Pi 

n 3 ,n 4 

IT - 

Po, Pi 

IT - 

6 

Dequeue N 3 

n 4 , p 2 , p 3 

0 

Po, Pi 

0 

7 

Dequeue N 4 

p 2 ,p 3 


Po,Pl 

n 4 

8 

Dequeue P 2 

p 3 

P 2 

Po, Pi 

n 4 

9 

Dequeue P 3 

0 

Pi 

P( J, Pi 

n 4 ,p 3 

10 

Verify P 2 

0 

0 

Po,Pi 

n 4 , p 3 , p 2 


or pruned (using Theorem |4.7| >, its NN-lists (especially, upper NN-list) are complete 
(hence, valid). 

First, we want to discuss a few observations. The first fact is, if at any point of time, 
a data point e not belonging to an entry E is covered in NN(E), then e is covered 
subsequently in the NN-list of E. Since e is covered at this instant, some ancestor E* 
of e must be present in the NN-list of E at that instant. Observe that after an entry is 
added to the NN-list of E, it is removed from the NN-list of E only when the NN-list 
of E is updated with the children of E* (lines 21,22). This ensures that e is forever 
covered in the NN-list of E. 

Similarly, e is covered subsequently in the NN-lists of all (sub-)children of E. At line 
18 of the algorithm, if E can’t be added to the results or pruned, after updating its NN- 
list with each entry present in ( 7 , its children are added to the queue. However, each 
child of E inherits its NN-list i.e. simply copies its NN-list (lines 21,22). Therefore, the 
children of E will also have e in their NN-list. 

Now we present the key lemma for our proof of correctness. 
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Fig. 4: Indexing tree 


LEMMA 4.8. The upper NN-list of every entry E, which is dequeued from the queue, 
is complete after line 17 of the RSTkNN algorithm. 

PROOF. Consider an execution of the algorithm, and suppose the current node to 
be dequeued from the queue is denoted by E. Let e be any data point and P denote 
the path from root to e in the tree. We will prove that after line 17 of the algorithm, e 
belongs to NNu(E). There are two possibilities (see Figure [4]for reference): 

Case A. e belongs to sub tree of E 
Case B. e does not belong to sub tree of E 

Case A is trivial. If e belongs to sub tree of E, it will be present in NNjj{E ) after line 
17, since any internal node adds itself to its NN-lists (line 12). 

Let us now consider Case B. Let t x be the time when E is dequeued from the queue. 
Now, one of these four different possibilities must be true at t\. 

Case B.l. Some node E e on the path P belongs to the result list ROL. 

Case B.2. Some node E e on the path P belongs to the pruned list PEL. 

Case B.3. Some node E e on the path P belongs to the queue Q. 

Case B.4. e belongs to the candidate list COL. 

Case B.l. Let t 0 denote the time when line 17 was encountered after E e was de¬ 
queued. Once again, there are two possibilities. 

Case: E belongs to the queue at t 0 . In this case, NNu(E) will contain E e through 
mutual effect (line 16) at t 0 . This implies that e is covered by NNu(E) at t\. 

Case: E does not belong to the queue at t 0 . If E does not belong to the queue, it im¬ 
plies that there exists some ancestor of E, say E* (cannot be E e because of condi¬ 
tion of Case B.l) which belongs to the queue at time to. Then NNu(E*) contains E e 
through mutual effect (line 16). This implies that once NNu(E*) contains e, upper 
NN-lists of all its discendant nodes will also contain e. 

The proof for Case B.2 and Case B.3 is similar to Case B.l. 

We now consider the remaining Case B.4. Since e e COL, it implies that some an¬ 
cestor E* of e was dequeued from the queue prior to t\. All the node present in the 
queue then contained E* in their upper NN-list through mutual effect. Therefore at l\ 
E contained E* in its upper NN-list. □ 
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THEOREM 4.9. Given an integer k, a query point Q, and an index tree R, the algo¬ 
rithm 2 correctly returns all RSTkNN points. 

Proof. The correctness follows from the following observations that were made 
earlier. 


— Internal nodes are accepted or pru ned (by IsHitOrDrop) only whe n the sufficient 
conditions according to the Theorem 4.7 are met (using Lemma 4.81. 

— For the data points left in the candidate list COL, in FinaLVerification, the (com¬ 
plete) NN-lists of every such point are updated with every other object (present in 
candidate, result and pruned list), before IsHitOrDrop being called on the point for 
directly accepting or pruning. Our FinaLVerification routine implements this in a 
rather straight forward manner. In line 32 of this routine, internal nodes present 
in PEL are replaced with their contained points to ensure that operations in this 
routine directly involve points. 


O 

5. CONCLUSION AND FUTURE WORK 

RfcNN is an important problem in facility location, operations research, clustering and 
other domains. We observed that a few published algorithms are not fully correct. In 
this paper we presented a correct algorithm to compute RfcNN on a general data set 
organised as a tree. We first discussed counter-examples to illustrate where the earlier 
algorithms made an error, and then discussed the necessity of maintaining locality 
and completeness conditions for ensuring the correctness of results. We finished by 
modifying one of the proposed algorithms along with an explanation why our algorithm 
is correct. 

In the future, we would like to extend our algorithm for performing bichromatic 
RSTfcNN algorithm. We would further like to develop algorithms where the objects 
are dynamic (e.g., moving in space, or textual attributes getting updated). 
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